Project: Investigate a Dataset (TMDb movie data)

Table of Contents

Introduction

We are going to use TMDb data; this data set contains informationabout 10,000 movies collected fromThe Movie Database (TMDb), including user ratings and revenue.

  • Certain columns, like ‘cast’and ‘genres’, contain multiplevalues separated by pipe (|) characters.
  • There are some odd charactersin the ‘cast’ column. We are going to clean them.
  • The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2015 dollars,accounting for inflation overtime.

We will download the data in csv format to be used as a data frame with the pandas package and we will proceed to select the columns that will be significant for our analysis, proceeding to clean the data and then explore the data trying to give some questions and make a summary in the conclusion section.

We are going to anwser the follow questions:

Data Wrangling

We are going to use the follow packages: pandas, numpy, matplotlib and seaborn

Load the data and look the some rows

Let's see what is the size of the data frame and some other information about the data.

The data Frame has 10,866 rows and 21 columns.

We see that there are varibles with na's values that we are going to clean them

Let's create a new column call profit that would be the difference between revenue_adj and budget_adj in movies data and divide by a million budget and revenue

We are going to work with the budget_adj and revenue_adj which are adjusted for inflation. The mean of those value is a big numbers, so we are going to pass to a million this columns to make this number more easy to read and we'll create profit column (revenue_adj - budget_adj)

We see that budget_adj and revenue_adj have cero value. It doesn't make sense so we are going to fix that later on.

Let's see duplicated rows

There is one duplicate row

The number of id value is one less than the number of row so there is one duplicated id movies. Maybe there can be more movies with the same title, but in any case they would be different movies.We are gonna drop the duplcated row

Let's check how many na's are in the data.

There are columns with hight numbers of na's, so we have to clean them.

Data Cleaning

We are going to remove some columns of movie data that are not significant for our analysis. We will inspect the na's that are still in the movie data to see which method will be more appropriate to clean up the na's, substituting the average of the same column or perhaps removing the specific row. We'll check for duplicate values to remove them, create a new column if needed, and rename columns if needed.

We will drop out from movies, the follows columns: id, runtime, homepage, tagline,keywords, original_title, overview, budget and revenue

Those variables are no significant for our analisis.

Let's check if budget and revenue have values with cero.

As we can see there are a lot of cero values in budget_adj and revenue_adj, we could drop them out but that data represent more than 50 % of our total data or, we can take the mean in the column and replace the ceros by the mean.

For this example we are going to replace the ceros values by the mean.

Let's replace ceros values by na's and after that replace by the mean of the column

Check again the total na's in the data frame.

About 10% of data remains with na's in production_companies but, I want to use this column, so we are going to drop those na's.

We want to make some analysis with production companies and with the rest of variable that still with na's values so we are going to drop all the rows with na's

Now there are not na's in the data, so the data has 9772 rows and 13 columns

Convert the string format in release_date to datatime in movies data

we convert that column to be able to order by date.

Now, the format in the column is datatime, so we can order and operate with that

Now, We have a clean data frame, so let's look to explore the data

Exploratory Data Analysis

*Now that we've trimmed and cleaned our data, I'm ready to move on to exploration. We are going to compute statistics and create some visualizations with the goal to answer some questions. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

Before we answer the question we are going to do some exploratory analisys to give us and idea about the data.

Let's make a histogram plot to take a look

Firstly we see in release_year is skew left that indicate that the number of movies has increased over time, there are few movies with more than 5 of popularity, vote_average is quite normal with the mean close to 6, most movies have a budget_adj of less than 100 millions and skew right, profit and revenue_adj have the same shape

Let's make a pairplot to try to see how the variables are releted each other.

Research Question 1 (How popularity can afect adjested revenue and profit by movie?)

We are goin to make a plot to try to anwser this question.

We see a strong relationship between popularity and adjusted revenue

Let's create a plot to see how much popularity afect profit by movie

High popularity is associated with high profits and large budgets.

Research Question 2 (How high budgets can afect popularity?)

We are goin to make a plot to try to anwser this question. Let's explore the relation between budget_adj and popularity Making a plot to try to get it

The regresion line shows a smooth positive relation among those variable, It seem that movies with hight budget have a better popularity.

What are the 20 movies with more vote_count?

We see that movie with high number of vote_count have high vote_average

Research Question 3 (What are the 20 most profitable movies of all time?)

The most profitable movie in all time is Star Wars, the second place is Avatar and Titanic in third place.

Research Question 4 (Large budgets produce large revenues?)

Let's make a plot to represent the relation between budget_adj and revenue_adj

We are going to use all the data

It seems that a big budget has and positive impact in the increase of the revenues.

Research Question 5 (What is the total number of movies by year?)

There is an increse in the number of total movies over time

In general, there was a constant increase in the number of movies along the way, in 2010 and 2015 there were a drop in the number of movies.

Research Question 6 (Who are the top 20 directors by profit?)

Steven Spielberg is the most profitable director over time

Let's plot the top 20 directors by profit

Research Question 7 (Who are the top 10 directors by popularity?)

Get the plot the top 10 Director by popularity)

The top 10 directors with more movies

Woody Allen is the most prolific director with 42 movies

Let's create a function to split the string values in a column in multiple rows : cast, genres and production_companies and create new colums as we need them.¶

We are going to create 3 dataframes where we split the data by cast, genres and production_companies from movies and after that, make some analysis and plots.

We have created 3 new data frame with more row tahn movies data and with the same number of columns.

We must take into account that our analysis is partial, since a film can have several genres; for example, 'Jurassic World' has four genres. 'Action', 'Adventure', 'Science Fiction', 'Thriller'. What percentage of each is involved in the film? we really don't know. So for example, when we calculate the number of movies that have a given genre, all the genres involved in that movie will count as one movie made.

We must take into account that our analysis is partial, since a film can have several genres; for example, 'Jurassic World' has four genres. 'Action', 'Adventure', 'Science Fiction', 'Thriller'. What percentage of each is involved in the film? we really don't know. So when we calculate the number of movies that have a given genre, all the genres involved in that movie will count as one movie made. It also happens when we take the average popularity since all the genres of a specific movie are going to have the same popularity. So this is a limitation of this work.

We see that in 2015 aventure genre had the highest mean popularity, in 2014 was Science Fiction.

How many times the genre appears first over time?

We see Animation appears 13 time in firts place, Adventure 11 times in the first place and Fantasy 9

Now we are going to get the top ten genres by year over time

Let's make a plot where we can see the evolution over time of the 10 most common genres.

Research Question 9 (What are the most common genres with the greatest number of films over time?)

We see the Drame follow by Comedy, Thriller, Action are the genres with more movies in total in the whole data.

Let's make a plot to capture the total movies by genre over time.

Drama is by far the genre with more movies follow by Comedy, Thriller and so on.

We'll make a line plot with the evolución over time of the number of movies by genres.

We have to be aware here that the same movie can be categorized with more the one genre, so we are going to get a plot when the sum of all number of movie of each genre is goin to be greather than the toyal number of movie that year.

Drama is the genre with the the largest and has the fastest growing number of movies, follow by comedy, thriller and action. Almost all genres had an increase in the number of movies over time except Western, History, War and Foreign. We'll do the same analisys but using this time porcentage of each genres over the total by year.

Let's get the porcentage of each genre

Wester and War have a decrease in the evolution over time of the number of movies released as a percentage of the total number of movie

Research Question 10 (What are the 20 most profitable genres by movie over time?)

We are going to see which is in average the most profitable genre over time

Adventure is the genre has acumulated most profits by movie over time, follow by Animation and Fantasy.

Let's create a data frame call companies_metric grouping by the name of the company where we agregate the most interesting variable together

We have the same number of rows than the number of unique companies

There are 7,842 different companies

Research Question 11 (What are the 20 most profitable companies over time in US dollars?)

We can see at the same time how many movies they did to get the total profits.

Warnes Bros is the company that has made more total profits over time follow by Universal Pictures and the Paramount Pictures. It is interesting to see that companies like Lucasfilm and Marvel Studios have made a lot of profits with few movies, that tell us that those companies had to do less effort to get big profits. In fact those kind of companies have a profit by movie higher than the Warner Bros.

We see that there is a big corralation between total profit (profit) and total number of movies (original_title) but is so low with profit/movie (0.030805) and that is because there are a lot of companies with few movies but with big profit by movie

Let's make a chart showing the 20 most profitable companies over time in US dollars¶

Warner Bros in the most profitable company over tiem, follow by Universal Pictures and Paramout Pictures

let's make a graph showing the total number of movies that each company did vs total profit.

The idea is to see in a plot this relation between those variabels

The plot shows a strong corralation that we have calculated before. Companies that have done more movies acumulated more profit over time. There are a lot of dots in the quadrant 50 movies and 5000 total profit.

Research Question 12 (What is the relationship between profit per movie and the total number of movies made?)

we are going to choose companies with at least 5 movies to see how much return by movie those companies get.

We see that companies with few movies have gotten better return (profit by movie).

Let's make a scatter plot to capture profit by movie vs number of movies represented by sized dots as a function of companies' total profit over time

Most of the points are located in the quadrant formed by profit/movie (300 million per movie) and number of movies (100 movies), all of them with a low total profit over time (small points), less than 20,000 million. We see a lot of points with higher profit more than 300 millions by movie with low number of movie and there are few big dots all of them with profit by movie of less than 200 millons by movie, so the five principles companies more profitable in total over time are not the companies with more profit by movie.

Research Question 13 (The top 20 actors with the most movies)

Robert De Niro with 72 movies is in the first place folowed by Samuel L. Jackson with 70 movies, Bruce Willis, Nicolas Cage and Michael Caine. You can see the complete list above.

The first five profitable actors are Harrison Ford followed by Tom Hanks, Tom Cruise, Mark Hamill, and Carrie Fisher. we can see Leonardo DiCaprio in the 19th position.

Let's get the top 10 vote_average actors

To be consistent with the data, we're only going to consider actors with at least 30 movies to get the rank, which gives us a sense of performance over time

There are few differences in the vote_average among those 10 actor

Research Question 14 (What kinds of properties are associated with movies that have high revenues?)

To answer this question we have to make a summary of the analysis carried out.

Conclusions

Limitation in our analisys

We have had some limitations in our analysis. First, at data cleanup time, we removed 1,030 rows where production_companies appeared with na's approximately 10% of our data, perhaps it would have been more appropriate to search for that information on the internet. We also found over 5,000 rows where budget and revenue had a zero value (more than 50% of the data), so we decided to take the average of those columns and replace the zero values ​​with the mean. Being such a high percentage of rows, the most logical thing would be to try to search for that data on the internet and replace it with real values since this distorts the analysis, but for the purposes of carrying out this project we decided to implement the average. Second, since a movie can have several genres; for example, 'Jurassic World' has four genres. 'Action', 'Adventure', 'Science Fiction', 'Thriller'. What percentage of each is involved in the film? we really don't know. So when we calculate the number of movies that have a given genre, all the genres involved in that movie will count as one movie made, which is another limitation of our analysis. Something similar happens when we take the average popularity of the genres, since all the genres of a specific movie are going to have the same popularity, so this is another limitation of this work and the same applies to the actors or actresses, since they do not we have an order of importance of them.

References